The Snf2 dataset

The RNA-Seq dataset we will use in this practical has been produced by Gierliński et al, 2015) and (Schurch et al, 2016)).

The dataset is composed of 48 samples of yeast wild-type (WT) strain, and 48 samples of Snf2 knock-out mutant cell line. The prepared RNA-Seq libraries (unstranded) were pooled and sequenced on seven lanes of a single flow-cell on an Illumina HiSeq 2000 resulting in a total of 1 billion 50-bp single-end reads across the 96 samples. RNA-Seq reads have been cleaned, mapped and counted to generated a count data matrix containing \(7126\) rows/genes.

In this tutorial, we will start from the count table generated by the authors of the study, and use different R tools in order to


Loading the dataset

R enables to download data directly from the Web with the download.file function. After, we begin with some verification steps.

The expression matrix and phenotypic information will be loaded into R using the read.table function. Both table will be converted into a data.frame object when loaded into R.

Note that the created tables include column names as factor object.

Reminder: a factor is a vector with levels (categories), which permits an efficient storage and indexing, but can in some cases lead to misleading effects. To circumvent this, we will sometimes have to convert the factor to a vector, with the R command as.vector().

# List of url and file names
url <- "http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data"
countFile <- "counts.txt"
phenoFile <- "expDesign.txt"

## Local paths: create a directory to store the data
dataDir <- ("~/jgb71e_practicals/rnaseq_snf2_Schurch_2015/data")
dir.create(dataDir, showWarnings = FALSE, recursive = TRUE)

# Download the data files unless they are already here
setwd(dataDir)

if(!file.exists(countFile))
  download.file(url = file.path(url,countFile), destfile = countFile )

if(!file.exists(phenoFile))
  download.file(url = file.path(url,phenoFile), destfile = phenoFile )

# Load the count table
countTable <- read.table(countFile, header=TRUE, row.names=1)
phenoTable <- read.table(phenoFile, header=TRUE, row.names=1)

# View part of the data/table
countTable[101:108, 1:5]
            WT1   WT2   WT3   WT4   WT5
yal034w-a    78   116   103   113   113
yal035w    2815  4160  2969  4411  4836
yal036c     331   599   348   525   691
yal037c-a     0    37    23    15    51
yal037c-b    70   183   199   208   196
yal037w      22    69    32    58    51
yal038w   21080 18869 14649 28551 20208
yal039c    1541  1447  1318  2251  1501
# Define a strain-specific color for each sample, and add it as a supplementary column
strainColor <- c("WT"="green","Snf2"="orange") # Choose one color per strain
## Define the column of each sample according to its genotype
phenoTable$color[phenoTable$strain == "Snf"] <- strainColor["Snf2"]
phenoTable$color[phenoTable$strain == "WT"] <- strainColor["WT"]

# Check the dimensions of the pheno table
colnames(phenoTable)
[1] "strain" "color" 

Exercice

  1. Find the commands to visualize the first and last line of a data.frame object (clue: similar to bash commands)
View solution|Hide solution


  1. Check the completeness (dimensions) of the data (tables)
View solution|Hide solution


  1. Check the number of replicates in each class using the table function. To better understand what you could do with this function, play with it (varying the parameters) with the following test table.
## Quick test with a small table generated on the flight
testTable <- data.frame(
  ABC = sample(c("A", "B", "C", "D", "E"), replace=TRUE, 20),
  gender = sample(c("boy", "girl", "robot"), replace=TRUE, 20),
  yesno = sample(c("yes", "no"), replace=TRUE, 20))
View solution|Hide solution


  1. Draw a barplot showing the total number of reads in each sample (clue: use colSums function) with sample colored (according to the color column in phenoTable). What can you say from this diagram?
View solution|Hide solution



Descriptive statistics

Basic statistics

Before going further in the analysis, we will compute some descriptive statistics on the dataset. At this stage we only compute statistics per sample, since statistics per gene are meaningful only after library-wise normalization of the counts.

# Min, Max, median (...). 
# Here on the first 4 samples
summary(countTable[,1:4])

Among the reference functions in R, apply (and its derivates) allow to repeated execution of a function to a serie of data, at a noteworthy higher speed tha using for loops. Note that, if working on a matrix/data frame, the apply function requires that you indicate if you want to apply the function to the rows (MARGIN=1) or to the columns (MARGIN=2). Some built-in functions already exist as colMeans, colSums, rowMeans and rowSums. Let’s use apply to get the summary for every column of the count table. It produces a matrix as output that we transpose to obtain samples as rows and the statistics as columns, and move back to a data.frame object.

statsPerSample <- data.frame(t(apply(countTable, 2, summary)))
head(statsPerSample)

We can now add some columns to the stats per sample. For example, here are the library sums and the 5% quantile.

statsPerSample$libsum <- colSums(countTable) # library sums
statsPerSample$perc05 <- apply(countTable, 2, quantile, probs=0.05)

Exercice

Integrate some more columns to statsPerSample: - the 10%, 90% and 95% quantile;

View solution|Hide solution


  • the number of zeros for each column;
View solution|Hide solution


statsPerSample$percZeros <- 100*statsPerSample$zeros/nrow(countTable)


Finally, randomly choose (with the sample() function) 10 samples (rows) of the stats per sample table and look at their statistics.

statsPerSample[sample(1:ncol(countTable), size = 10),]

## We can nicely format the result for the report, using the knitr::kable function
kable(statsPerSample[sample(1:ncol(countTable), size = 10),])

Distributions

The summary only displays a few milestone values (mean, median, quartiles). In order to get a better intuition of the data, we can draw an histogram of all values.

hist(as.matrix(countTable), col="blue", border="white")
Histogram of raw counts. The scale is determined by the gene with the highest count, which is apparently an outlier.

Histogram of raw counts. The scale is determined by the gene with the highest count, which is apparently an outlier.

The histogram is not very informative so far, apparently due to the presence of a few very high count values, that impose a very large scale on the \(X\) axis. We can use a logarithmic transformation to improve the readability. Note that we will add pseudo counts to avoid problems with “zero” counts observed for some genes in some samples.

epsilon <- 1 # pseudo-count to avoid problems with log(0)
hist(as.matrix(log2(countTable + epsilon)), breaks=100, col="blue", border="white",
     main="Histogram of log2-transformed counts per gene", xlab="log2(counts)", ylab="Number of genes")

To get better insights into the distribution per sample, the boxplot offer a good perspective.

boxplot(log2(countTable + epsilon), col=phenoTable$color, pch=".", 
        horizontal=TRUE, cex.axis=0.5,
        las=1, ylab="Samples", xlab="log2(Counts +1)")
Box plots of non-normalized log2(counts) per sample.

Box plots of non-normalized log2(counts) per sample.

Another way to get an intuition of the value distributions is to use the plotDensity function, which draws one density curve for each sample.

# We will require one function from the affy package
if(!require("affy")){
  source("http://bioconductor.org/biocLite.R")
  biocLite("affy")  
}
library(affy)
plotDensity(log2(countTable + epsilon), lty=1, col=phenoTable$color, lwd=2)
grid()
legend("topright", legend=names(strainColor), col=strainColor, lwd=2)
Densities of log2(counts). Each curve corresponds to one sample.

Densities of log2(counts). Each curve corresponds to one sample.

Beware: the R function plotDensity does not display the actual distribution of your values, but a polynomial fit. The representation thus generally looks smoother than the actual data. It is important to realize that, in some particular cases, the fit can lead to extrapolated values which can be misleading.

Scatter plots

Let’s have a look at the scatter plots using the pairs function. We will only represent 10 randomly selected samples.

# Define a function to draw a scatter plot for a pair of variables (samples) with density colors
plotFun <- function(x,y){ 
  dns <- densCols(x,y); 
  points(x,y, col=dns, pch=".", panel.first=grid());  
#  abline(a=0, b=1, col="brown")
  }

# Plot the scatter plot for a few pairs of variables selected at random
set.seed(123) # forces the random number generator to produce fixed results
pairs(log2(countTable[,sample(ncol(countTable),5)] + epsilon), 
      panel=plotFun, lower.panel = NULL)
**Scatter plot of log2-counts for a random selection of samples. **

Scatter plot of log2-counts for a random selection of samples.

The command pairs draws a scatter plot for each pair of columns of the input dataset. The plot shows a fairly good reproducibility between samples of the same type (WT or KO, respectively): all points are aligned along te diagonal, with a relatively wider dispersion at the bottom, corresponding to small number fluctuations.

In contrast, on all the plots comparing a WT and a KO sample, we can see some points (genes) discarding from the diagonal.


Eliminating undetected genes

All genes from genome the S. cerevisiae where quantified. However only a fraction of them where expressed and some of them where to weakly expressed to be detected in any of the sample. As a result the count table may contain rows with only zero values (null counts).

View solution|Hide solution

Selecting random samples

One of the questions that will drive the analysis will be to define the impact of the number of sample on the analysis.

The original study contained 48 replicates per genotype, what happens if we select a smaller number?

In a first time, we will analyse the full dataset with 48 samples per genotype.

In the second part of the tutorial, each attendee of this course will select a given number (e.g. 3, 4, 5, 10, 15, 20, 35, 40, 45, …) and adapt the code below run the analysis with that number of replicates per genotype.

We will at the end then compare the results (number of genes, significance, …) between the full dataset analysis, and the analysis of sub-samples of varying sizes.

We will already select here the subset of yeast samples for our sub-sampling analysis.

## Select a subset of samples from each genotype for the sub-sampling analysis

nb.replicates <- 10 # Each attendee chooses a number (3,4,5,10,15 or 20)

samples.WT <- sample(1:48, size=nb.replicates, replace=FALSE)

# Random sampling of the Snf2 replicates (columns 49 to 96)
samples.Snf2 <- sample(49:96, size=nb.replicates, replace=FALSE)

selected.samples <- c(samples.WT, samples.Snf2)

# Don't forget to update colors
col.pheno.selected <- phenoTable$color[selected.samples]

Differential analysis with DESeq2

In this section we will search for genes whose expression is affected by the genetic invalidation. You will first need to install the DESeq2 bioconductor library then load it.

# Install the library if needed then load it
if(!require("DESeq2")){
  install.packages("lazyeval")
  install.packages("ggplot2")
  
  source("http://bioconductor.org/biocLite.R")
  biocLite("DESeq2")
}
library("DESeq2")

Analysis of the full dataset

We will first analyse together the full dataset. The analysis of subsets of samples will be done as an exercise.


Creating a DESeqDataSet dataset

We will then create a DESeqDataSet using the DESeqDataSetFromMatrix function. Get some help about the DESeqDataSet and have a look at some important accessor methods: counts, conditions, estimateSizeFactors, sizeFactors, estimateDispersions and nbinomTest.

# Use the DESeqDataSetFromMatrix to create a DESeqDataSet object
dds0 <- DESeqDataSetFromMatrix(countData = countTable, colData = phenoTable, design = ~ strain)

## Check the general properties of the DESeq dataset
print(dds0)
class: DESeqDataSet 
dim: 6887 96 
metadata(1): version
assays(1): counts
rownames(6887): 15s_rrna 21s_rrna ... ty(gua)m2 ty(gua)o
rowData names(0):
colnames(96): WT1 WT2 ... Snf47 Snf48
colData names(2): strain color
# What kind of object is it ?
is(dds0)
[1] "DESeqDataSet"               "RangedSummarizedExperiment"
[3] "SummarizedExperiment"       "Vector"                    
[5] "Annotated"                 
isS4(dds0)
[1] TRUE
# What does it contain ?
# The list of slot names
slotNames(dds0)
[1] "design"             "dispersionFunction" "rowRanges"         
[4] "colData"            "assays"             "NAMES"             
[7] "elementMetadata"    "metadata"          
# Get some help about the "CountDataSet" class.
# NOT RUN
#?"DESeqDataSet-class"

Normalization

The normalization procedure (RLE) is implemented through the estimateSizeFactors function.

How is the scaling factor computed ?

Note: this section is somewhat technical. If the underlying mathematics seem a bit too complicated do not worry, the next sections will be practice-oriented.

Given a matrix with \(p\) columns (samples) and \(n\) rows (genes) this function estimates the size factors as follows.

  • Each column element is divided by the geometric means of the rows.
  • For each sample, the median (or, if requested, another location estimator) of these ratios (skipping the genes with a geometric mean of zero) is used as the size factor for this column.

The scaling factor for sample \(j\) is thus obtained as:

\[sf_{j} = median(\frac{K_{g,j}}{(\prod_{j=1}^p K_{g,j})^{1/p}})\]

# Let's implement such a function
# cds is a countDataset
estimSf <- function (cds){
    # Get the count matrix
    cts <- counts(cds)
    
    # Compute the geometric mean
    geomMean <- function(x) prod(x)^(1/length(x))

    # Compute the geometric mean over the line
    gm.mean  <-  apply(cts, 1, geomMean)
    
    # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    gm.mean[gm.mean == 0] <- NA
    
    # Divide each line by its corresponding geometric mean
    # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    # MARGIN: 1 or 2 (line or columns)
    # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    # FUN: the function to be applied
    cts <- sweep(cts, 1, gm.mean, FUN="/")
    
    # Compute the median over the columns
    med <- apply(cts, 2, median, na.rm=TRUE)
    
    # Return the scaling factor
    return(med)
}

Now, check that the results obtained with our function are the same as those produced by DESeq. The method associated with normalization for the “CountDataSet” class is estimateSizeFactors.

# Normalizing using the method for an object of class"CountDataSet" 
dds.norm <-  estimateSizeFactors(dds0)
sizeFactors(dds.norm)
  WT1   WT2   WT3   WT4   WT5   WT6   WT7   WT8   WT9  WT10  WT11  WT12 
0.586 0.925 0.751 1.109 0.883 1.687 0.896 0.940 0.889 0.911 0.643 0.836 
 WT13  WT14  WT15  WT16  WT17  WT18  WT19  WT20  WT21  WT22  WT23  WT24 
1.022 1.100 0.782 0.550 0.885 0.878 0.939 0.631 1.053 1.444 0.870 0.910 
 WT25  WT26  WT27  WT28  WT29  WT30  WT31  WT32  WT33  WT34  WT35  WT36 
1.194 0.974 1.049 1.178 0.877 1.258 0.939 0.644 0.977 0.828 0.769 0.781 
 WT37  WT38  WT39  WT40  WT41  WT42  WT43  WT44  WT45  WT46  WT47  WT48 
0.922 0.647 0.714 1.277 1.063 0.803 0.739 0.691 1.166 0.604 1.272 0.854 
 Snf1  Snf2  Snf3  Snf4  Snf5  Snf6  Snf7  Snf8  Snf9 Snf10 Snf11 Snf12 
1.395 1.086 1.078 1.447 0.875 1.374 1.717 0.967 1.270 0.977 1.186 0.970 
Snf13 Snf14 Snf15 Snf16 Snf17 Snf18 Snf19 Snf20 Snf21 Snf22 Snf23 Snf24 
1.253 1.278 0.906 1.234 1.275 1.118 0.917 1.124 1.199 0.870 0.997 1.674 
Snf25 Snf26 Snf27 Snf28 Snf29 Snf30 Snf31 Snf32 Snf33 Snf34 Snf35 Snf36 
0.911 1.312 0.822 0.966 1.076 0.937 0.859 1.265 1.559 1.133 0.953 0.974 
Snf37 Snf38 Snf39 Snf40 Snf41 Snf42 Snf43 Snf44 Snf45 Snf46 Snf47 Snf48 
1.078 0.988 1.958 1.073 1.225 0.850 1.221 1.257 0.989 1.146 1.229 0.976 
# Now get the scaling factor with our homemade function.cds.norm
head(estimSf(dds0)) 
  WT1   WT2   WT3   WT4   WT5   WT6 
0.543 0.884 0.725 1.062 0.841 1.603 
all(round(estimSf(dds0),6) == round(sizeFactors(dds.norm), 6))
[1] FALSE
# Checking the normalization
par(mfrow=c(2,2),cex.lab=0.7)
boxplot(log2(counts(dds.norm)+epsilon),  col=col.pheno.selected, cex.axis=0.7, 
        las=1, xlab="log2(counts)", horizontal=TRUE, main="Raw counts")
boxplot(log2(counts(dds.norm, normalized=TRUE)+epsilon),  col=col.pheno.selected, cex.axis=0.7, 
        las=1, xlab="log2(normalized counts)", horizontal=TRUE, main="Normalized counts") 
plotDensity(log2(counts(dds.norm)+epsilon),  col=col.pheno.selected, 
            xlab="log2(counts)", cex.lab=0.7, panel.first=grid()) 
plotDensity(log2(counts(dds.norm, normalized=TRUE)+epsilon), col=col.pheno.selected, 
            xlab="log2(normalized counts)", cex.lab=0.7, panel.first=grid()) 
**Impact of the count normalization. **

Impact of the count normalization.

# Restore default parameters
par(mfrow=c(1,1), cex.lab=1)

Modeling read counts

Let us imagine that we would produce a lot of RNA-Seq experiments from the same samples (technical replicates). For each gene \(g\) the measured read counts would be expected to vary rather slighlty around the expected mean and would be probably well modeled using a Poisson distribution. However, when working with biological replicates more variations are intrinsically expected. Indeed, the measured expression values for each genes are expected to fluctuate more importantly, due to the combination of biological and technical factors: inter-individual variations in gene regulation, sample purity, cell-synchronization issues or reponses to environment (e.g. heat-shock).

The Poisson distribution has only one parameter indicating its expected mean : \(\lambda\). The variance of the distribution equals its mean \(\lambda\). Thus in most cases, the Poisson distribution is not expected to fit very well with the count distribution in biological replicates, since we expect some over-dispersion (greater variability) due to biological noise.

As a consequence, when working with RNA-Seq data, many of the current approaches for differential expression call rely on an alternative distribution: the negative binomial (note that this holds true also for other -Seq approaches, e.g. ChIP-Seq with replicates).

What is the negative binomial ?

The negative binomial distribution is a discrete distribution that give us the probability of observing \(x\) failures before a target number of succes \(n\) is obtained. As we will see later the negative binomial can also be used to model over-dispersed data (in this case this overdispersion is relative to the poisson model).

The probability of \(x\) failures before \(n\) success

First, given a Bernouilli trial with a probability \(p\) of success, the negative binomial distribution describes the probability of observing \(x\) failures before a target number of successes \(n\) is reached. In this case the parameters of the distribution will thus be \(p\), \(n\) (in dnbinom function of R, \(n\) and \(p\) are denoted by arguments size and prob respectively).

\[P_{NegBin}(x; n, p) = \binom{x+n-1}{x}\cdot p^n \cdot (1-p)^x = C^{x}_{x+n-1}\cdot p^n \cdot (1-p)^x \]

In this formula, \(p^n\) denotes the probability to observe \(n\) successes, \((1-p)^x\) the probability of \(x\) failures, and the binomial coefficient \(C^{x}_{x+n-1}\) indicates the number of possible ways to dispose \(x\) failures among the \(x+n-1\) trials that precede the last one (the problem statement imposes for the last trial to be a success).

The negative binomial distribution has expected value \(n\frac{q}{p}\) and variance \(n\frac{q}{p^2}\). Some examples of using this distribution in R are provided below.

Particular case: when \(n=1\) the negative binomial corresponds to the the geometric distribution, which models the probability distribution to observe the first success after \(x\) failures: \(P_{NegBin}(x; 1, p) = P_{geom}(x; p) = p \cdot (1-p)^x\).

par(mfrow=c(1,1))

# Some intuition about the negative binomiale parametrized using n and p.
# The simple case, one success (see geometric distribution)
# Let's have a look at the density
p <- 1/6 # the probability of success
n <- 1   # target for number of successful trials

# The density function
plot(0:10, dnbinom(0:10, n, p), type="h", col="blue", lwd=4)
Negative binomial distribution.

Negative binomial distribution.

# the probability of zero failure before one success.
# i.e the probability of success 
dnbinom(0, n , p)
[1] 0.167
# i.e the probability of at most 5 failure before one success. 
sum(dnbinom(0:5, n , p)) # == pnbinom(5, 1, p)
[1] 0.665
# The probability of at most 10 failures before one sucess 
sum(dnbinom(0:10, n , p)) # == pnbinom(10, 1, p)
[1] 0.865
# The probability to have more than 10 failures before one sucess
1-sum(dnbinom(0:10, n , p)) # == 1 - pnbinom(10, 1, p)
[1] 0.135
# With two successes
# The probability of x failure before two success (e.g. two six)
n <- 2
plot(0:30, dnbinom(0:30, n, p), type="h", col="blue", lwd=2,
     main="Negative binomial density",
     ylab="P(x; n,p)",
     xlab=paste("x = number of failures before", n, "successes"))

# Expected value
q <- 1-p
(ev <- n*q/p)
[1] 10
abline(v=ev, col="darkgreen", lwd=2)

# Variance 
(v <- n*q/p^2)
[1] 60
arrows(x0=ev-sqrt(v), y0 = 0.04, x1=ev+sqrt(v), y1=0.04, col="brown",lwd=2, code=3, , length=0.2, angle=20)
Negative binomial distribution.

Negative binomial distribution.

Using mean and dispersion

The second way of parametrizing the distribution is using the mean value \(m\) and the dispersion parameter \(r\) (in dnbinom function of R, \(m\) and \(r\) are denoted by arguments mu and size respectively). The variance of the distribution can then be computed as \(m + m^2/r\).

Note that \(m\) can be deduced from \(n\) and \(p\).

n <- 10
p <- 1/6
q <- 1-p
mu <- n*q/p

all(dnbinom(0:100, mu=mu, size=n) == dnbinom(0:100, size=n, prob=p))
[1] TRUE

Modelling read counts through a negative binomial

To perform diffential expression call DESeq will assume that, for each gene, the read counts are generated by a negative binomial distribution. One problem here will be to estimate, for each gene, the two parameters of the negative binomial distribution: mean and dispersion.

  • The mean will be estimated from the observed normalized counts in both conditions.

  • The first step will be to compute a gene-wise dispersion. When the number of available samples is insufficient to obtain a reliable estimator of the variance for each gene, DESeq will apply a shrinkage strategy, which assumes that counts produced by genes with similar expression level (counts) have similar variance (note that this is a strong assumption). DESeq will regress the gene-wise dispersion onto the means of the normalized counts to obtain an estimate of the dispersion that will be subsequently used to build the binomial model for each gene.

# Performing estimation of dispersion parameter
dds.disp <- estimateDispersions(dds.norm)

# A diagnostic plot which
# shows the mean of normalized counts (x axis)
# and dispersion estimate for each genes
plotDispEsts(dds.disp)



Performing differential expression call

Now that a negative binomial model has been fitted for each gene, the nbinomWaldTest can be used to test for differential expression. The output is a data.frame which contains nominal p-values, as well as FDR values (correction for multiple tests computed with the Benjamini–Hochberg procedure).

alpha <- 0.0001
waldTestResult <- nbinomWaldTest(dds.disp)
resultDESeq2 <- results(waldTestResult, alpha=alpha, pAdjustMethod="BH")

# What is the object returned by nbinomTest()
class(resultDESeq2)
[1] "DESeqResults"
attr(,"package")
[1] "DESeq2"
is(resultDESeq2) # a data.frame
[1] "DESeqResults"    "DataFrame"       "DataTable"       "SimpleList"     
[5] "DataTableORNULL" "List"            "Vector"          "Annotated"      
slotNames(resultDESeq2)
[1] "rownames"        "nrows"           "listData"        "elementType"    
[5] "elementMetadata" "metadata"       
head(resultDESeq2)
log2 fold change (MAP): strain WT vs Snf 
Wald test p-value: strain WT vs Snf 
DataFrame with 6 rows and 6 columns
          baseMean log2FoldChange     lfcSE      stat        pvalue
         <numeric>      <numeric> <numeric> <numeric>     <numeric>
15s_rrna     18.34        -0.1453     0.291    -0.499 0.61806723393
21s_rrna    107.33         0.0843     0.252     0.335 0.73780434012
hra1          2.53         0.7432     0.208     3.569 0.00035825052
icr1        141.57        -0.2149     0.037    -5.816 0.00000000601
lsr1        207.53         0.1322     0.153     0.864 0.38740549424
nme1         26.48        -0.0823     0.131    -0.630 0.52852462875
                 padj
            <numeric>
15s_rrna 0.6738989809
21s_rrna 0.7826231668
hra1     0.0006499904
icr1     0.0000000163
lsr1     0.4497721567
nme1     0.5894985211
# The column names of the data.frame
# Note the column padj 
# contains FDR values (computed Benjamini–Hochberg procedure)
colnames(resultDESeq2)
[1] "baseMean"       "log2FoldChange" "lfcSE"          "stat"          
[5] "pvalue"         "padj"          
# Order the table by decreasing p-valuer
resultDESeq2 <- resultDESeq2[order(resultDESeq2$padj),]
head(resultDESeq2)
log2 fold change (MAP): strain WT vs Snf 
Wald test p-value: strain WT vs Snf 
DataFrame with 6 rows and 6 columns
           baseMean log2FoldChange     lfcSE      stat    pvalue      padj
          <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
yar009c        1283           2.43    0.0464      52.3         0         0
yar071w        1674           4.24    0.0472      89.7         0         0
ybl005w         542           1.24    0.0261      47.7         0         0
ybl005w-b      1449           2.17    0.0389      55.7         0         0
ybl100w-b       454           1.23    0.0289      42.5         0         0
ybr012w-b      1199           2.01    0.0414      48.5         0         0
# Draw an histogram of the p-values
hist(resultDESeq2$padj, breaks=20, col="grey", main="DESeq2 p-value distribution", xlab="DESeq2 P-value", ylab="Number of genes")
Histogram of the p-values reported by DESeq2.

Histogram of the p-values reported by DESeq2.

Volcano plot

alpha <- 0.01 # Threshold on the adjusted p-value
cols <- densCols(resultDESeq2$log2FoldChange, -log10(resultDESeq2$pvalue))
plot(resultDESeq2$log2FoldChange, -log10(resultDESeq2$padj), col=cols, panel.first=grid(),
     main="Volcano plot", xlab="Effect size: log2(fold-change)", ylab="-log10(adjusted p-value)",
     pch=20, cex=0.6)
abline(v=0)
abline(v=c(-1,1), col="brown")
abline(h=-log10(alpha), col="brown")

gn.selected <- abs(resultDESeq2$log2FoldChange) > 2 & resultDESeq2$padj < alpha 
text(resultDESeq2$log2FoldChange[gn.selected],
     -log10(resultDESeq2$padj)[gn.selected],
     lab=rownames(resultDESeq2)[gn.selected ], cex=0.4)
Volcano plot of DESeq2 results. Abcsissa: log2(fold-change). Ordinate: significance ($-log_{10}(P-value)$).

Volcano plot of DESeq2 results. Abcsissa: log2(fold-change). Ordinate: significance (\(-log_{10}(P-value)\)).

Check the expression levels of the most differentially expressed gene

It may be important to check the validity of our analysis by simply assessing the expression level of the most highly differential gene.

gn.most.sign <- rownames(resultDESeq2)[1]
gn.most.diff.val <- counts(dds.norm, normalized=T)[gn.most.sign,]
barplot(gn.most.diff.val, col=col.pheno.selected, main=gn.most.sign, las=2, cex.names=0.5)
Barplot of the counts per sample fr a selected gene.

Barplot of the counts per sample fr a selected gene.

Looking at the results with a MA plot

One popular diagram in dna chip analysis is the M versus A plot (MA plot) between two conditions \(a\) and \(b\). In this representation :

  • M (Minus) is the log ratio of counts calculated for any gene. \[M_g = log2(\bar{x}_{g,a}) - log2(\bar{x}_{g,b})\]
  • A (add) is the average log counts which corresponds to an estimate of the gene expression level. \[A_g = \frac{1}{2}(log2(\bar{x}_g,a) + log2(\bar{x}_g,b))\]
# Draw a MA plot.
# Genes with adjusted p-values below 1% are shown
plotMA(resultDESeq2, colNonSig = "blue", main="MA plot")
abline(h=c(-1:1), col="red")
MA plot. The abcsissa indicates the mean of normalized counts; the ordinate the log2(fold-change).

MA plot. The abcsissa indicates the mean of normalized counts; the ordinate the log2(fold-change).


Hierarchical clustering

To ensure that the selected genes distinguish well between “treated”" and “untreated” condition we will perform a hierachical clustering using the heatmap.2 function from the gplots library.

# We select gene names based on FDR (1%)
gene.kept <- rownames(resultDESeq2)[resultDESeq2$padj <= alpha & !is.na(resultDESeq2$padj)]

# We retrieve the normalized counts for gene of interest
countTable.kept <- log2(countTable + epsilon)[gene.kept, ]
dim(countTable.kept)
[1] 4374   96
# Install the gplots library if needed then load it
if(!require("gplots")){
  install.packages("gplots")
}
library("gplots")

# Perform the hierarchical clustering with
# A distance based on Pearson-correlation coefficient
# and average linkage clustering as agglomeration criteria
heatmap.2(as.matrix(countTable.kept), 
          scale="row", 
          hclust=function(x) hclust(x,method="average"), 
          distfun=function(x) as.dist((1-cor(t(x)))/2), 
          trace="none", 
          density="none", 
          labRow="",
          cexCol=0.7)
Heatmap of the gebes deckared significant with DESeq2. Rows correspond to genes, columns to samples.

Heatmap of the gebes deckared significant with DESeq2. Rows correspond to genes, columns to samples.


Functional enrichment

We will now perform functional enrichment using the list of induced genes. This step will be performed using the gProfileR R library.

library(gProfileR)

resultDESeq2.df <- na.omit(data.frame(resultDESeq2))
induced.sign <- rownames(resultDESeq2.df)[resultDESeq2.df$log2FoldChange >= 2 &  resultDESeq2.df$padj < alpha]
# head(induced.sign)
# names(term.induced)

term.induced <- gprofiler(query=induced.sign, organism="scerevisiae")
term.induced <- term.induced[order(term.induced$p.value),]
# term.induced$p.value
kable(term.induced[1:10,c("term.name",
                      "term.size",
                      "query.size",
                      "overlap.size",
                      "recall",
                      "precision",
                      "p.value", 
                      "intersection")], 
      format.args=c(engeneer=TRUE, digits=3), caption="**Table: functional analysis wit gProfileR. ** ")
Table: functional analysis wit gProfileR.
term.name term.size query.size overlap.size recall precision p.value intersection
52 RNA-DNA hybrid ribonuclease activity 48 84 30 0.357 0.625 0 YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
8 DNA integration 50 84 30 0.357 0.600 0 YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
62 RNA-directed DNA polymerase activity 52 84 30 0.357 0.577 0 YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
55 aspartic-type peptidase activity 54 84 29 0.345 0.537 0 YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
57 aspartic-type endopeptidase activity 54 84 29 0.345 0.537 0 YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
51 endoribonuclease activity, producing 5’-phosphomonoesters 63 84 30 0.357 0.476 0 YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
63 DNA-directed DNA polymerase activity 64 84 30 0.357 0.469 0 YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
61 DNA polymerase activity 68 84 30 0.357 0.441 0 YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
48 endonuclease activity, active with either ribo- or deoxyribonucleic acids and producing 5’-phosphomonoesters 71 84 30 0.357 0.423 0 YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
50 endoribonuclease activity 77 84 30 0.357 0.390 0 YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B

And now using the list of repressed genes.

resultDESeq2.df <- na.omit(data.frame(resultDESeq2))
repressed.sign <- rownames(resultDESeq2.df)[resultDESeq2.df$log2FoldChange <= -2 &  resultDESeq2.df$padj < alpha]
head(repressed.sign)
[1] "ycr025c" "yer081w" "ygr051c" "ygr087c" "ylr111w" "yml122c"
term.repressed <- gprofiler(query=repressed.sign, organism="scerevisiae")
term.repressed <- term.repressed[order(term.repressed$p.value),]
kable(head(term.induced[,c("p.value", "term.name","intersection")], 10))
p.value term.name intersection
52 0 RNA-DNA hybrid ribonuclease activity YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
8 0 DNA integration YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
62 0 RNA-directed DNA polymerase activity YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
55 0 aspartic-type peptidase activity YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
57 0 aspartic-type endopeptidase activity YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
51 0 endoribonuclease activity, producing 5’-phosphomonoesters YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
63 0 DNA-directed DNA polymerase activity YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
61 0 DNA polymerase activity YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
48 0 endonuclease activity, active with either ribo- or deoxyribonucleic acids and producing 5’-phosphomonoesters YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
50 0 endoribonuclease activity YAR009C,YBL005W-B,YBR012W-B,YDR098C-B,YDR210C-D,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YGR161C-D,YHR214C-B,YJR027W,YJR029W,YLR035C-A,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B

Exercise: assess the effect of sample number on differential expression call

Using a loop, randomly select 10 times 2,5,10,15..45 samples from WT and Snf2 KO. Perform differential expression calls and draw a diagram showing the number of differential expressed genes.

## Local paths: create a directory to store the results
resultDir <- ("~/jgb71e_practicals/rnaseq_snf2_Schurch_2015/results")
dir.create(resultDir, showWarnings = FALSE, recursive = TRUE)
setwd(resultDir)



# Export the table with statistics per sample.
write.table(statsPerSample, file=file.path(resultDir, "stats_per_sample.tsv"),
            quote=FALSE, sep="\t", col.names =NA, row.names = TRUE)

# Export the DESeq2 result table
DESeq2Table <- file.path(resultDir, "yeast_Snf2_vs_WT_DESeq2_diff.tsv")
write.table(resultDESeq2, file=DESeq2Table, col.names = NA, row.names = TRUE, sep="\t", quote = FALSE)